Planning in Reward-Rich Domains via PAC Bandits
نویسندگان
چکیده
In some decision-making environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinite-armed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0. We present several algorithms and use them to identify reliable strategies for solving screens from the video games Infinite Mario and Pitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known.
منابع مشابه
Habilitation À Diriger Des Recherches De L'université Paris-est Agrégation Pac-bayésienne Et Bandits À Plusieurs Bras Pac-bayesian Aggregation and Multi-armed Bandits
متن کامل
PAC Bandits with Risk Constraints
We study the problem of best arm identification with risk constraints within the setting of fixed confidence pure exploration bandits (PAC bandits). The goal is to stop as fast as possible, and with high confidence return an arm whose mean is -close to the best arm among those that satisfy a risk constraint, namely their α-quantile functions are larger than a threshold β. For this risk-sensitiv...
متن کاملSkyline Identification in Multi-Armed Bandits
We introduce a variant of the classical PAC multi-armed bandit problem. There is an ordered set of n arms A[1], . . . , A[n], each with some stochastic reward drawn from some unknown bounded distribution. The goal is to identify the skyline of the set A, consisting of all arms A[i] such that A[i] has larger expected reward than all lower-numbered arms A[1], . . . , A[i− 1]. We define a natural ...
متن کاملMulti-Armed Bandits, Gittins Index, and its Calculation
Multi-armed bandit is a colorful term that refers to the di lemma faced by a gambler playing in a casino with multiple slot machines (which were colloquially called onearmed bandits). W h a t strategy should a gambler use to pick the machine to play next? It is the one for which the posterior mean of winning is the highest and thereby maximizes current expected reward, or the one for which the ...
متن کاملModal Bandits
Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.
متن کامل